In https://www.feststelltaste.de/reading-a-git-log-file-output-with-pandas/ I show you a way to read in Git log data with Pandas's DataFrame and GitPython.
Looking back, this was really difficult and tedious to do. So with a few tricks we can do it much more better.
There are three new ideas that I introduce here:
This method also scales for analyzing huge GitHub repositories like the one for Linux. In the case for the Linux repo it has as of today almost 700.000 commits. But be warned: Cloning the repo as well as retrieving the Git log informaiton takes a while. I recommend cloning my fork because I fixed some encoding errors in author's names via an extended .mailmap file that maps these weird names to real names. Otherwise Python cannot read in the Git log because of an UnicodeError exception. On Windows machines, you'll also get some weird errors while cloning. But this doesn't influence the analysis.
The first step is to connect GitPython with the Git repo. If we have an instance of the repo, we can gain access to the underlying Git installation of the operation system via repo.git.
In this case, again, we tap the Spring Pet Clinic project, a small sample application for the Spring framework.
In [4]:
import git
GIT_LOG_FILE = r'${REPO}/spring-petclinic'
repo = git.Repo(GIT_LOG_FILE)
git_bin = repo.git
git_bin
Out[4]:
With the git_bin, we can execute almost any Git command we like directly. In our hypothetical use case, we want to retrieve some information about the change frequency of files. For this, we need the complete history of the Git repo including statistics for the changed files (via --numstat).
We use a little trick to make sure, that the format for the file's statistics fits nicely with the commit's metadata (SHA %h, UNIX timestamp %at and author's name %aN). The --numstat option provides data for additions, deletions and the affected file name in one line, separated by the tabulator character \t:
1\t1\tsome/file/name.ext
We use the same tabular separator \t for the format string:
%h\t%at\t%aN
And here is the trick: Additionally, we add the amount of tabulators of the file's statistics plus an additional tabulator in front of the format string to pretend that there are empty file statistics' information in front of the format string.
The results looks like this:
\t\t\t%h\t%at\t%aN
Note: If you want to export the Git log on the command line into a file to read that file later, you need to use the tabulator character xxx as separator instead of \t in the format string. Otherwise, the trick doesn't work.
OK, let's first executed the Git log export:
In [5]:
git_log = git_bin.execute('git log --numstat --pretty=format:"\t\t\t%h\t%at\t%aN"')
git_log[:100]
Out[5]:
We now read in the complete files' history in the git_log variable. Don't let confuse you by all the \t characters.
Let's read the result into a Pandas DataFrame by using the read_csv method. Because we can't provide a file path to a CSV data, we have to use StringIO to read in our in-memory buffered content.
Pandas will read the first line of the tabular-separated "file", sees the many tabular-separated columns and parses all other lines in the same format / column layout. Additionaly, we set the header to None because we don't have one and provide nice names for all the columns that we read in.
In [6]:
import pandas as pd
from io import StringIO
commits_raw = pd.read_csv(StringIO(git_log),
sep="\t",
header=None,
names=['additions', 'deletions', 'filename', 'sha', 'timestamp', 'author']
)
commits_raw.head()
Out[6]:
The last steps are easy. We fill all the empty file statistics rows with the commit's metadata.
In [7]:
commits = commits_raw.fillna(method='ffill')
commits.head()
Out[7]:
And drop all the commit metadata rows that don't contain file statitics.
In [8]:
commits = commits.dropna()
commits.head()
Out[8]:
We are finished! This is it.
In summary, you'll need "one-liner" for converting a Git log file output that was exported with
git log --numstat --pretty=format:"%x09%x09%x09%h%x09%at%x09%aN" > git.log
into a DataFrame:
In [21]:
pd.read_csv("../../spring-petclinic/git.log",
sep="\t",
header=None,
names=[
'additions',
'deletions',
'filename',
'sha',
'timestamp',
'author']).fillna(method='ffill').dropna().head()
Out[21]:
As a bonus, we can now convert some columns to their right data types. The columns additions and deletions columns are representing the added or deletes lines of code respectively. But there are also a few exceptions for binary files like the images. We skip these lines with the errors='coerce' option. This will lead to Nan< in the rows that will be dropped after the converstion.
The timestamp column is a UNIX timestamp with the past seconds since January 1st 1970.
In [ ]:
commits['additions'] = pd.to_numeric(commits['additions'], errors='coerce')
commits['deletions'] = pd.to_numeric(commits['deletions'], errors='coerce')
commits = commits.dropna()
commits['timestamp'] = pd.to_datetime(commits['timestamp'], unit="s")
commits.head()
In [ ]:
commits.groupby('filename')[['timestamp']].count().sort_values(by='timestamp', ascending=False).head(10)
In [ ]:
java_commits = commits[commits['filename'].str.endswith(".c")]
java_commits.head()
In [ ]:
java_commits.groupby('author').sum()[['additions']].sort_values(by='additions', ascending=False).head()
In [ ]:
commits[commits['timestamp'].max() == commits['timestamp']]
In [ ]:
java_commits[java_commits['timestamp'].min() == java_commits['timestamp']]
In [ ]:
commits = commits[commits['timestamp'] <= 'today']
In [ ]:
latest = commits.sort_values(by='timestamp', ascending=False)
latest.head()
In [ ]:
In [ ]:
commits['today'] = pd.Timestamp('today')
commits.head()
In [ ]:
initial_commit_date = commits[-1:]['timestamp'].values[0]
initial_commit_date
In [ ]:
commits = commits[commits['timestamp'] >= initial_commit_date]
commits.head()
In [ ]:
commits['age'] = commits['timestamp'] - commits['today']
commits.head()
In [ ]:
commits.groupby('filename')[['age']].min().sort_values(by='age').head(10)
In [ ]:
java_commits.groupby('filename')\
.count()[['additions']]\
.sort_values(by='additions', ascending=False).head()
In [ ]:
ages = commits.sort_values(by='age', ascending=False).drop_duplicates(subset=['filename'])['age'] * -1
ages.head()
In [ ]:
ages.dt.days.hist()
In [ ]:
commits.groupby('filename')
In [ ]:
import glob
file_list = [
os.path.abspath(path).replace(os.sep, "/") for path in glob.iglob("../../linux/**/*.*")]
file_list[:5]
In [ ]:
[os.path.normpath
In [ ]:
%matplotlib inline
commits.groupby('filename')\
.count()[['additions']]\
.sort_values(by='additions', ascending=False)\
.plot(kind='bar')
In [ ]:
commits.sort_values(by='age', ascending=False).groupby('filename').first().sort_values(by='age', ascending=False)
In [ ]:
%matplotlib inline
commits.groupby('filename')\
.count()['additions']\
.hist(bins=20)
In [ ]:
commits.groupby('filename').count().sort_values(by='additions', ascending=False)
In [ ]:
After this, we have to tell Git which information we want. We can do this via the pretty-format option.
For each commit, we choose to create a header line with the following commit info (by using --pretty=format:'--%h--%ad--%aN'), which gives us the following output:
--fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--feststelltasteIt contains the SHA key, the timestamp as well as the author's name of the commit, separated by a character that isn't certaninly in these information--. My favorite separator for this is \u0012
We also want to have some details about the modifications of each file per commit. This is why we use the --numstat flag.
Together with the --all flag to get all commits and the --no-renames flag to avoid commits that only rename files, we retrieve all the needed information directly via Git.
For each other row, we got some statistics about the modified files:
2 0 src/main/asciidoc/appendices/bibliography.adoc
It contains the number of lines inserted, the number of lines deleted and the relative path of the file. With a little trick and a little bit of data wrangling, we can read that information into a nicely structured DataFrame.
The first entries of that file look something like this:
Let's get started!
First, I'll show you my approach on how to read nearly everything into a DataFrame. The key is to use Pandas' read_csv for reading "non-character separated values". How to do that? We simply choose a separator that doesn't occur in the file that we want to read. My favorite character for this is the "DEVICE CONTROL TWO" character U+0012. I haven't encountered a situation yet where this character was included in a data set.
We just read our git.log file without any headers (because there are none) and give the only column a nice name.
OK, but now we have a problem data wrangling challenge. We have the commit info as well as the statistic for the modified file in one column, but they don't belong together. What we want is to have the commit info along with the file statistics in separate columns to get some serious analysis started.
In [ ]:
commits.groupby('author').sum()[['additions']].sort_values(by='additions', ascending=False)
OK, this part is ready, let's have a look at the file statistics!
We're done!
Just some milliseconds to run through, not bad!
In this notebook, I showed you how to read some non-perfect structured data via the non-character separator trick. I also showed you how to transform the rows that contain multiple kinds of data into one nicely structured DataFrame.
Now that we have the Git repository DataFrame, we can do some nice things with it e. g. visualizing the code churn of a project, but that's a story for another notebook! But to give you a short preview:
In [ ]:
%matplotlib inline
timed_commits = java_commits.set_index(pd.DatetimeIndex(java_commits['timestamp']))[['additions', 'deletions']].resample('1D').sum()
timed_commits
(timed_commits['additions'] - timed_commits['deletions']).cumsum().fillna(method='ffill').plot()
In [ ]:
In [ ]:
c = commits[commits['timestamp'] <= 'today']
c.sort_values(by='timestamp', ascending=False).head()
In [ ]:
c = c\
.groupby('sha')\
.first()\
.reset_index()
c.head()
In [ ]:
%matplotlib inline
c.set_index(
pd.DatetimeIndex(c['timestamp'])
)['additions']\
.resample('W-SUN', convention='start')\
.count()\
.tail(500)\
.plot(kind='area', figsize=(100,7))
In [ ]:
c.set_index(
pd.DatetimeIndex(c['timestamp'])
)['additions']\
.resample('W-SUN', convention='start')\
.count()\
.tail(500)\
In [ ]:
df = c.set_index(
pd.DatetimeIndex(c['timestamp']))
df2 = df.resample('W').count().dropna()
df2.tail()
In [ ]:
df2['month'] = df2.index.month
df2.head()
In [ ]:
df3 = df2.groupby([df2.index.year, df2.index.month]).aggregate({'month': 'first', 'sha' : 'min'})
df3.head()
In [ ]:
df3.groupby(df3.index).count()
In [ ]:
I hope you see how easy it is to retrieve some insights from your version control system by using Python and Pandas for some data wrangling. Because it's too good to be true, there are the drawbacks of this simple approach:
We observe the file change frequency, not the code change frequency:
We also didn't look up if the files we are analyzing are still existing or have been deleted in the past. Because for our change analysis,
In [ ]:
%matplotlib inline
commits['author'].value_counts().plot(kind='pie', figsize=(10,10))